Methods of genetic cluster analysis and uses thereof

ABSTRACT

The present invention is primarily directed to methods of genetic cluster analysis for use in determining the homogeneity and/or heterogeneity of a population or sub-population. Determination of the heterogeneity or homogeneity of a population sample is important in many areas including DNA fingerprinting in forensics and population-based studies such as clinical trials, case-control studies of risk factors, and gene mapping studies.

RELATED APPLICATIONS

This application is a divisional of U.S. patent application Ser. No.09/693,333, filed Oct. 20, 2000, which claims the benefit of U.S.Provisional Patent Application No. 60/161,231, filed Oct. 22, 1999 andU.S. Provisional Patent Application No. 60/216,897, filed Jul. 7, 2000,both of which are hereby incorporated by reference herein in theirentirety including any figures, drawings, or tables.

FIELD OF THE INVENTION

The present invention is in the field of applied genomics, and isprimarily directed to methods of genetic cluster analysis to determinethe homogeneity and/or heterogeneity of a population or sub-population.

BACKGROUND OF THE INVENTION

The following discussion is meant to aid in the understanding of theinvention, but is not intended to, and is not admitted to, describeprior art to the invention.

Populations can vary considerably with respect to the number andfrequency of genetic variants that they possess. Analogously, orconsequently, individuals within such populations can vary considerablyin terms of the composition of their genomes. Both of these factscontribute to the tremendous phenotypic variation exhibited amongindividuals within and between populations.

Various genetic systems and analyses have been used to assess therelationship between genes and phenotype (Lander & Schork, Science 265:2037-2048, 1994). Fundamental to such analyses is the assumption thatthe sample of individuals with (and/or without) a particular phenotypechosen for study is “homogenous” with respect to the cause of thephenotype (i.e. the individuals in the sample have (or don't have) thephenotype for some reason). When this is not the case, a relevant studyof the relationship between the phenotype and its determinant(s) isunlikely to be successful. Assessment of the “similarity” of individualsin a sample with respect to genetic backgrounds and molecular profilemay thus provide a useful measure of this homogeneity (Curnow, J.Agricul., Biol., And Environ Stat. 3: 347-358,1998). Such homogeneityassessment can be of value to any study, but depends on theidentification of individuals with certain features based on somedistinguishing genetic characteristics, such as forensic applications.

Analyses assessing the similarity in the genetic profiles of individualshave been pursued. For example, polymorphic microsatellites (primarilyCA repeats) have been used to construct trees of human individuals thatreflect their geographic origin (Bowcock et al., Nature 368:455-457,1994), and to study the genetic variability within and between cattlebreeds (Ciampolini, et al., J. Anim. Sci. 73:3259-3268, 1995). RFLPgenotypes have been used to construct trees of individuals of differentethnicities (Mountain and Cavalli-Sforza, Am. J. Hum. Genet. 61:705-718,1997). Random amplified polymorphic DNA (RAPD) markers have been used tocompute genetic similarity coefficients (Lamboy, PCR Methods andApplications 4:31-37, 1994), and to compare phenotype and genotype inplants (Jasienski, et al., Heredity 78:176-181, 1997).

However, these analyses often rely on a priori knowledge of the groupsto which the individuals belong. Many do not permit the determination inthe absence of a priori knowledge of which, and to what degree,different populations may have contributed to the genetic variationwithin a pool or sample of individuals. However, in the large majorityof cases, individuals sampled from a population represent an “admixture”of genes from several populations. These populations are reflected inthe genetic profiles of individuals and hence can defy populationsegregation based on traditional markers such as skin color and/orself-reported ethnic affiliation. Therefore, methods of analysis areneeded to accurately determine the existence of clusters of geneticallysimilar individuals, absent phenotypic (ethnic, for example)information. As noted previously, knowledge of the homogeneity orheterogeneity of a population can be important under many circumstancesincluding forensics and population-based studies.

In forensics, DNA fingerprinting requires the computation of ‘matchprobabilities’ between the suspect and the DNA obtained on a victim.Match probabilities are often computed relative to a database ofnon-suspect DNA. The utility of the DNA contributed by non-suspects willbe influenced by the amount of genetic heterogeneity among thenon-suspects (Jin & Chakraborty, Heredity 74:274-285, 1995; Sawyer etal., Am. J. Hum. Genet. 59:272-274, 1996; Tomsey et al., J. ForensicSci. 44:385-388, 1999). Thus, determining the heterogeneity of thenon-suspect population sample (on its own and compared with the DNAobtained on a victim) is important for a meaningful control.

In addition, many population-based studies, such as large clinicaltrials, case-control studies of disease risk factors, and gene mappingstudies, assume that the populations under study are relativelyhomogenous genetically. When this assumption is erroneous, falseinferences about the efficacy of a compound or the role of a particularrisk factor in disease pathogenesis, for example, can result. Assessmentof the heterogeneity of the population avoids misleading results.

SUMMARY OF THE INVENTION

A basic assumption is that genetic similarity should correlate withphenotypic similarity; there should be a relationship between genotypeand phenotype at all relevant loci. The validity of this assumption fora given population has important implications for the accuracy and theunderstanding of the results of a given study, for example the efficacyof a compound or the role of a particular risk factor in diseasepathogenesis, among others.

A corollary to this assumption is that a given population ishomogeneous; the patient and control populations for instance should begenetically different from each other at loci which influence the traitstudied, but should otherwise be a homogenous population. However, up tonow there has been no good way to test the heterogeneity or homogeneityof a population at the genetic level; the assessment has been made atthe phenotypic levels. The methods of the invention allow thedetermination of the heterogeneity or homogeneity of a population at thegenetic level.

In a first aspect, the invention features methods of clustering membersof a sample, comprising: a) identifying traits to be measured; b)assigning values to said traits; c) assigning weights based on thefrequency of selected trait values in said sample; d) forming an orderedset of similarity data by determining the similarity between pairs ofmembers of said sample using said trait values and said weights; e)calculating an ordered set of distance data from said ordered set ofsimilarity data; f) applying a hierarchical clustering algorithm to saiddistance matrix; g) determining the optimal number of clusters based onthe results of said hierarchical clustering algorithm, wherein saidoptimal number of clusters is found where the average pairwiseintracluster similarity for two parent clusters is larger than theaverage pairwise intracluster similarity for a new cluster; h) applyinga non-hierarchical clustering algorithm to said ordered set ofsimilarity data using said optimal number of clusters; i) determiningthe relatedness between pairs of homozygous pairs by performing apaired-pair analysis on the clusters resulting from saidnon-hierarchical clustering algorithm, wherein the homozygous loci oftwo pairs are compared pairwise to determine whether the pairs share thesame homozygous alleles on the same loci; j) summing said paired-paircomparison for one pair versus all pairs in a cluster; k) computing theaverage sum of said paired-pair comparison for all pairs in saidcluster; l) assigning values to the homozygous relatedness of eachmember of a pair to all homozygotes in said cluster based on whethersaid sum for one pair is greater than or equal to said average sum ofall pairs, or whether said sum for one pair is less than said averagesum for all pairs; m) comparing the number of times said sum for onepair is greater than or equal to said average sum of all pairs with thenumber of times said sum for one pair is less than said average sum forall pairs for each individual in said cluster; and n) dividing saidcluster into a first cluster and a second cluster if there is: a firstgroup of members of said cluster wherein said number of times said sumof one pair is less than said average sum of pairs is greater than orequal to said number of times said sum for one pair is greater than orequal to said average sum for all pairs, and a second group of membersof said cluster wherein said number of times said sum of one pair isless than said average sum of pairs is less than said number of timessaid sum for one pair is greater than or equal to said average sum forall pairs, and wherein said first group of members are placed into saidfirst cluster and said second group of members are placed into saidsecond cluster.

In preferred embodiments of the invention, said traits are genetic lociand said values are assigned to said traits based on the alleles of saidgenetic loci. Preferably, said values are: 0 when a pair of membersshare no common allele; 1 when a pair of members share a common allele;and 2 when a pair of members share two common alleles. Preferably, saidweights are assigned based on: sharing rare alleles between a pair ofmembers; and sharing a homozygous genotype between a pair of members.

In other preferred embodiments, said ordered set of similarity data ispresent in a similarity matrix. Preferably, said similarity matrix isformed based on the pairwise similarity measure:$S_{ij} = {\sum\limits_{k = 1}^{L}\frac{a_{k,{ij}} \cdot w_{r,k,{ij}} \cdot w_{h,{kij}}}{2L}}$

-   where i and j denote the particular members;-   where k denotes a particular locus;-   where L denotes the total number of loci;-   where    -   a_(k,ij) equals 0 for locus k when members i and j share no        common allele,    -   a_(k,ij) equals 1 for locus k when members i and j have one        common allele, and    -   a_(k,ij) equals 2 for locus k when members i and j have two        common alleles; where-   w_(r,k,ij) denotes a weight function for sharing rare alleles; and-   where w_(h,k,ij) denotes a weight function for sharing a homozygous    genotype.

In other preferred embodiments, said ordered set of distance data ispresent in a distance matrix. Preferably, said distance matrix isgenerated using distances selected from the group consisting ofEuclidean squared distance, Euclidean distance, absolute distance, andcorrelation coefficient. Most preferably, said distance matrix isgenerated using Euclidean squared distance.

In yet other preferred embodiments, said hierarchical clusteringalgorithm produces a dendrogram based on calculations selected from thegroup consisting of minimum variance, nearest neighbor, furthestneighbor, group average, and median computations. Preferably, saidhierarchical clustering algorithm produces a dendrogram based on minimumvariance calculations.

In still other preferred embodiments, said non-hierarchical clusteringalgorithm uses K-means clustering. Preferably, said K-means clusteringuses classifiers selected from the group consisting of sum of squares,nearest neighbor method, weighted nearest neighbor method, and pseudonearest neighbor combined classifier. Most preferably, said classifieris sum of squares.

A second aspect of the invention features methods of determining thepresence of clusters in a sample, comprising: a) assigning weights toselected trait values of members of said sample based on the frequencyof said selected trait values in said sample; b) forming an ordered setof similarity data by determining the similarity between pairs ofmembers of said sample using said weights; and c) applying a clusteringalgorithm to said ordered set of similarity data.

In preferred embodiments of the invention, said weights are assignedbased on: sharing rare alleles between a pair of members; and sharing ahomozygous genotype between a pair of members.

In other preferred embodiments of the invention, said ordered set ofsimilarity data is present in a similarity matrix. Preferably, saidsimilarity matrix is formed based on the pairwise similarity measure:$S_{ij} = {\sum\limits_{k = 1}^{L}\frac{a_{k,{ij}} \cdot w_{r,k,{ij}} \cdot w_{h,{kij}}}{2L}}$

-   where i and j denote the particular members;-   where k denotes a particular locus;-   where L denotes the total number of loci;-   where    -   a_(k,ij) equals 0 for locus k when members i and j share no        common allele,    -   a_(k,ij) equals 1 for locus k when members i and j have one        common allele, and    -   a_(k,ij) equals 2 for locus k when members i and j have two        common alleles; where-   w_(r,k,ij) denotes a weight function for sharing rare alleles; and-   where w_(h,k,ij) denotes a weight function for sharing a homozygous    genotype.

In other preferred embodiments, the invention further comprisescalculating an ordered set of distance data from said ordered set ofsimilarity data. Preferably, said ordered set of distance data ispresent in a distance matrix. Preferably, said distance matrix isgenerated using distances selected from the group consisting ofEuclidean squared distance, Euclidean distance, absolute distance, andcorrelation coefficient. Most preferably, said distance matrix isgenerated using Euclidean squared distance.

In still other preferred embodiments, said clustering algorithm is ahierarchical clustering algorithm. Preferably, said hierarchicalclustering algorithm produces a dendrogram based on calculationsselected from the group consisting of minimum variance, nearestneighbor, furthest neighbor, group average, and median computations.Most preferably, said hierarchical clustering algorithm produces adendrogram based on minimum variance calculations.

In yet other preferred embodiments, said clustering algorithm is anon-hierarchical clustering algorithm. Preferably, said non-hierachicalclustering algorithm uses K-means clustering. Preferably, said K-meansclustering uses classifiers selected from the group consisting of sum ofsquares, nearest neighbor method, weighted nearest neighbor method, andpseudo nearest neighbor combined classifier. Most preferably, saidclassifier is sum of squares.

A third aspect of the invention features methods of determining thenumber of clusters in a sample, comprising: a) applying a hierarchicalclustering algorithm to the members of said sample; and b) determiningthe optimal number of clusters based on the results of said hierarchicalclustering algorithm, wherein said optimal number of clusters is foundwhere the average pairwise intracluster similarity for two parentclusters is larger than the average pairwise intracluster similarity fora new cluster.

A fourth aspect of the invention features methods of clustering membersof a sample, comprising: a) applying a non-hierarchical clusteringalgorithm to said sample; b) determining the relatedness between pairsof homozygous pairs by performing a paired-pair analysis on the clustersresulting from said non-hierarchical clustering algorithm, wherein thehomozygous loci of two pairs are compared pairwise to determine whetherthe pairs share the same homozygous alleles on the same loci; c) summingsaid paired-pair comparison for one pair versus all pairs in a cluster;d) computing the average sum of said paired-pair comparison for allpairs in said cluster; e) assigning values to the homozygous relatednessof each individual in a pair to all homozygotes in said cluster based onwhether said sum for one pair is greater than or equal to said averagesum of all pairs, or whether said sum for one pair is less than saidaverage sum for all pairs; f) comparing the number of times said sum forone pair is greater than or equal to said average sum of all pairs withthe number of times said sum for one pair is less than said average sumfor all pairs for each individual in said cluster; and g) dividing saidcluster into a first cluster and a second cluster if there is: a firstgroup of individuals within said cluster wherein said number of timessaid sum of one pair is less than said average sum of pairs is greaterthan or equal to said number of times said sum for one pair is greaterthan or equal to said average sum for all pairs, and a second group ofindividuals within said cluster wherein said number of times said sum ofone pair is less than said average sum of pairs is less than said numberof times said sum for one pair is greater than or equal to said averagesum for all pairs, and wherein said first group of individuals areplaced into said first cluster and said second group of individuals areplaced into said second cluster.

In a fifth aspect, the invention features a method of clustering membersof a sample, comprising: applying a hierarchical clustering algorithm tosaid members of said sample; determining the optimal number of clustersbased on the results of said hierarchical clustering algorithm; anddistributing said members of said sample into said optimal number ofclusters using non-hierarchical clustering. Preferably, said optimalnumber of clusters is found where the average pairwise intraclustersimilarity for two parent clusters is larger than the average pairwiseintracluster similarity for a new cluster.

In preferred embodiments, this method of clustering further comprises:applying paired-pair analysis to said members in said optimal number ofclusters to determine whether said clusters should be further divided;and distributing said members into additional clusters based on theresults of said paired-pair analysis. Preferably, said paired-pairanalysis compares the homozygousm or heterozygous loci of pairspairwise.

In other preferred embodiments, this method of clustering furthercomprises constructing a similarity matrix prior to said applying step.Preferably, said similarity matrix is formed based on a pairwisesimilarity measure. Preferably, said pairwise similarity measure is ofvalues for genetic loci.

In other preferred embodiments, this method of clustering furthercomprises constructing a distance matrix from said similarity matrixprior to said applying step.

In a sixth aspect, the invention features a method of determining theoptimal number of clusters of members in a sample, comprising: applyinga hierarchical clustering algorithm to said members of said sample; anddetermining the optimal number of clusters based on the results of saidhierarchical clustering algorithm. Preferably, said optimal number ofclusters is found where the average pairwise intracluster similarity fortwo parent clusters is larger than the average pairwise intraclustersimilarity for a new cluster.

In a seventh aspect, the invention features a method of clusteringmembers of a sample, comprising: applying a non-hierarchical clusteringalgorithm to said sample; applying paired-pair analysis to pairs ofhomozygous pairs in any clusters resulting from said non-hierarchicalclustering algorithm to determine whether said non-hierarchical clustersshould be further divided; and distributing said members into additionalclusters based on the results of said paired-pair analysis. Preferably,said paired-pair analysis compares the homozygous or heterozygous lociof pairs pairwise.

In an eighth aspect, the invention features a computer system fordetermining clusters of members in a sample, comprising: Firstinstructions for applying a hierarchical clustering algorithm to saidmembers of said sample; Second instructions for determining the optimalnumber of clusters based on the results of said hierarchical clusteringalgorithm; and Third instructions for distributing said members of saidsample into said optimal number of clusters using non-hierarchicalclustering. Preferably, said optimal number of clusters is found wherethe average pairwise intracluster similarity for two parent clusters islarger than the average pairwise intracluster similarity for a newcluster.

In preferred embodiments of this computer system, it further comprisesFourth instructions for applying paired-pair analysis to said members ofsaid optimal number of clusters to determine whether said clustersshould be further divided; and Fifth instructions for distributing saidmembers into additional clusters based on the results of saidpaired-pair analysis. Preferably, said paired-pair analysis compares thehomozygous or heterozygous loci of pairs pairwise.

In preferred embodiments of this computer system, it further comprisessixth instructions for constructing a similarity matrix prior to saidapplying step. Preferably, said similarity matrix is formed based on apairwise similarity measure. Preferably, said pairwise similaritymeasure is of values for genetic loci.

In other preferred embodiments of this computer system, it furthercomprises seventh instructions for constructing a distance matrix fromsaid similarity matrix prior to said applying step.

In a ninth aspect, the invention features a computer system fordetermining the optimal number of clusters in a sample, comprising: afirst module configured to apply a hierarchical clustering algorithm tothe members of said sample; and a second module configured to determinethe optimal number of clusters based on the results of said hierarchicalclustering algorithm. Preferably, said optimal number of clusters isfound where the average pairwise intracluster similarity for two parentclusters is larger than the average pairwise intracluster similarity fora new cluster.

In a tenth aspect, the invention features a computer system forclustering members of a sample, comprising: First instructions forapplying a non-hierarchical clustering algorithm to said sample; Secondinstructions for applying paired-pair analysis to pairs of homozygouspairs in any clusters resulting from said non-hierarchical clusteringalgorithm to determine whether said non-hierarchical clusters should befurther divided; and Third instructions for distribution of said membersinto additional clusters based on the results of said paired-pairanalysis. Preferably, said paired-pair analysis compares the homozygousor heterozygous loci of pairs pairwise.

In an eleventh aspect, the invention features a programmed storagedevice comprising instructions that when executed perform a method forclustering members of a sample, comprising: applying a hierarchicalclustering algorithm to said members of said sample; determining theoptimal number of clusters based on the results of said hierarchicalclustering algorithm; and distributing said members of said sample intosaid optimal number of clusters using non-hierarchical clustering.Preferably, said optimal number of clusters is found where the averagepairwise intracluster similarity for two parent clusters is larger thanthe average pairwise intracluster similarity for a new cluster.

In preferred embodiments, the programmed storage device furthercomprises instructions that when executed apply paired-pair analysis tosaid members of said optimal number of clusters to determine whethersaid clusters should be further divided; and distributing said membersinto additional clusters based on the results of said paired-pairanalysis. Preferably, said paired-pair analysis compares the homozygousor heterozygous loci of pairs pairwise.

In preferred embodiments, the programmed storage device furthercomprises instructions that when executed construct a similarity matrixprior to said applying step. Preferably, said similarity matrix isformed based on a pairwise similarity measure. Preferably, said pairwisesimilarity measure is of values for genetic loci.

In preferred embodiments, the programmed storage device furthercomprises instructions that when executed construct a distance matrixfrom said similarity matrix prior to said applying step.

In a twelfth aspect, the invention features a programmed storage devicecomprising instructions that when executed perform a method fordetermining the number of clusters in a sample, comprising: applying ahierarchical clustering algorithm to the members of said sample; anddetermining the optimal number of clusters based on the results of saidhierarchical clustering algorithm. Preferably, said optimal number ofclusters is found where the average pairwise intracluster similarity fortwo parent clusters is larger than the average pairwise intraclustersimilarity for a new cluster.

In a thirteenth aspect, the invention features a programmed storagedevice comprising instructions that when executed perform a method forclustering members of a sample, comprising: applying a non-hierarchicalclustering algorithm to said sample; applying paired-pair analysis topairs of homozygous pairs in any clusters resulting from saidnon-hierarchical clustering algorithm to determine whether saidnon-hierarchical clusters should be further divided; and distributingsaid members into additional clusters based on the results of saidpaired-pair analysis. Preferably, said paired-pair analysis compares thehomozygous or heterozygous loci of pairs pairwise.

Additional embodiments are set forth in the Detailed Description of theInvention, in the Implementation of the Methods of the Inventionsection, and in the Examples.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a graph of the optimal number of clusters for 475ethnically diverse individuals using the genetic clustering analysis ofthe invention based on SNP genotyping at 15 loci, 50 loci and 65 loci.

FIG. 2 shows a bar graph of the ethnic affiliation of each individualwithin Cluster 1 for the clustering analysis performed with 15, 50, and65 loci.

FIG. 3 shows a bar graph of the ethnic affiliation of each individualwithin Cluster 2 for the clustering analysis performed with 15, 50, and65 loci.

FIG. 4 shows a bar graph of the ethnic affiliation of each individualwithin Cluster 3 for the clustering analysis performed with 15, 50, and65 loci.

FIG. 5 shows a bar graph of the ethnic affiliation of each individualwithin Cluster 4 for the clustering analysis performed with 15, 50, and65 loci.

FIG. 6 shows a bar graph of the ethnic affiliation of each individualwithin Cluster 5 for the clustering analysis performed with 15, 50, and65 loci.

FIG. 7 shows a bar graph of the ethnic affiliation of each individualwithin Cluster 6 for the clustering analysis performed with 15, 50, and65 loci.

FIG. 8 shows a bar graph of the ethnic affiliation of each individualwithin Cluster 7, Cluster 8, and Cluster 9 for the clustering analysisperformed with 65 loci.

FIG. 9 shows a graphical representation of the results of the clusteringanalysis performed separately on each ethnic group using 15, 50, and 65loci.

FIGS. 10, 11 & 12 show the identification of particular clusters,obtained from analyzing each ethnic group separately, by ethnicsubpopulations from clusters obtained by analyzing all the ethnic groupsas a total population. The shaded bars represent the separation intoclusters of each ethnic group run separately with 15 or 50 loci (samedata as that present in FIG. 9). The black bars represent the locationof the individuals separated into clusters when the total population wasanalyzed with 15 or 50 loci (data present in FIGS. 2-7). FIG. 10 showsthe information for Cluster 1 and Cluster 2 (from FIGS. 2 and 3,respectively). FIG. 11 shows the information for Cluster 3 and Cluster 4(FIGS. 4 and 5, respectively). FIG. 12 shows the information for Cluster5 and Cluster 6 (FIGS. 6 and 7, respectively).

FIG. 13 shows a graphical representation of the results of theclustering analysis for the combined population (328 individuals: caseand two controls).

FIG. 14 shows a graphical representation of the proportion ofindividuals from case and control in each cluster.

FIGS. 15, 16, and 17 show the identification of particular clusters,obtained from analyzing case and control populations separately, by caseand control subpopulations from clusters identified by analyzing thecase and control populations together. FIG. 15 shows the analysis forCluster 1, Cluster 2, and Cluster3. FIG. 16 shows the analysis forCluster 4, Cluster 5, and Cluster 6. FIG. 16 shows the analysis forCluster 7 and Cluster 8.

FIG. 18 shows a diagram of an automated system.

FIG. 19 shows a diagram of an alternative automated system.

FIG. 20 shows a diagram of the process of clustering members of apopulation.

FIG. 21 shows a diagram of the process of determining the optimal numberof clusters in a sample.

FIG. 22 shows a diagram of the process of paired-pair analysis.

DETAILED DESCRIPTION OF THE INVENTION

Prior to describing the invention in more detail, the followingdefinitions are provided for clarification.

I. Definitions

The terms “trait” and “phenotype” are used interchangeably herein andrefer to any visible, detectable or otherwise measurable property of anorganism such as symptoms of, or susceptibility to a disease forexample.

The term “allele” is used herein to refer to variants of a nucleotidesequence. A biallelic polymorphism has two forms. Typically the firstidentified allele is designated as the original allele whereas otheralleles are designated as alternative alleles. Diploid organisms may behomozygous or heterozygous for an allelic form.

The term “heterozygosity rate” is used herein to refer to the incidenceof individuals in a population, which are heterozygous at a particularallele. In a biallelic system the heterozygosity rate is on averageequal to 2P_(a)(1−P_(a)), where P_(a) is the frequency of the leastcommon allele. In order to be useful in genetic studies a genetic markershould have an adequate level of heterozygosity to allow a reasonableprobability that a randomly selected person will be heterozygous.

The term “genotype” as used herein refers the identity of the allelespresent in an individual or a sample. In the context of the presentinvention a genotype preferably refers to the description of thebiallelic marker alleles present in an individual or a sample. The term“genotyping” a sample or an individual for a biallelic marker consistsof determining the specific allele or the specific nucleotide carried byan individual at a biallelic marker.

The term “polymorphism” as used herein refers to the occurrence of twoor more alternative genomic sequences or alleles between or amongdifferent genomes or individuals. “Polymorphic” refers to the conditionin which two or more variants of a specific genomic sequence can befound in a population. A “polymorphic site” is the locus at which thevariation occurs. A single nucleotide polymorphism is a single base pairchange. Typically a single nucleotide polymorphism is the replacement ofone nucleotide by another nucleotide at the polymorphic site. Deletionof a single nucleotide or insertion of a single nucleotide, also giverise to single nucleotide polymorphisms. In the context of the presentinvention “single nucleotide polymorphism” preferably refers to a singlenucleotide substitution. Typically, between different genomes or betweendifferent individuals, the polymorphic site may be occupied by twodifferent nucleotides.

A database includes indexed and freeform tables for storing data. Withineach table are a series of fields that store data strings, such asnames, addresses, chemical names, and the like. However, it should berealized that several types of databases are available. For example, adatabase might only include a list of data strings arranged in a column.Other databases might be relational databases wherein several twodimensional tables are linked through common fields. Embodiments of theinvention are not limited to any particular type of database.

An input device can be, for example, a keyboard, rollerball, mouse,voice recognition system, automated script from another computer thatgenerates a file, or other device capable of transmitting informationfrom a customer to a computer. The input device can also be a touchscreen associated with the display, in which case the customer respondsto prompts on the display by touching the screen. The customer may entertextual information through the input device such as the keyboard or thetouch-screen.

Instructions refer to computer-implemented steps for processinginformation in the system. Instructions can be implemented in software,firmware or hardware and include any type of programmed step undertakenby components and modules of the system.

One example of a Local Area Network may be a corporate computingnetwork, including access to the Internet, to which computers andcomputing devices comprising the system are connected. In oneembodiment, the LAN conforms to the Transmission ControlProtocol/Internet Protocol (TCP/IP) industry standard. In alternativeembodiments, the LAN may conform to other network standards, including,but not limited to, the International Standards Organization's OpenSystems Interconnection, IBM's SNA, Novell's Netware, and Banyan VINES.

A microprocessor as used herein may be any conventional general purposesingle- or multi-chip microprocessor such as a Pentium® processor, aPentium® Pro processor, a 8051 processor, a MIPS® processor, a Power PC®processor, or an ALPHA® processor. In addition, the microprocessor maybe any conventional special purpose microprocessor such as a digitalsignal processor or a graphics processor. The microprocessor typicallyhas conventional address lines, conventional data lines, and one or moreconventional control lines.

SNPs as used herein, refer to biallelic markers, which aregenome-derived polynucleotides that exhibit biallelic polymorphism. Asused herein, the term biallelic marker means a biallelic singlenucleotide polymorphism. As used herein, the term polymorphism mayinclude a single base substitution, insertion, or deletion. Bydefinition, the lowest allele frequency of a biallelic polymorphism is1% (sequence variants which show allele frequencies below 1% are calledrare mutations). There are potentially more than 10 biallelic markerswhich can easily be typed by routine automated techniques, such assequence- or hybridization-based techniques, out of which 10⁶ aresufficiently informative for mapping purposes.

The system is comprised of various modules as discussed in detail below.As can be appreciated by one of ordinary skill in the art, each of themodules comprises various sub-routines, instructions, commands,procedures, definitional statements and macros. Each of the modules aretypically separately compiled and linked into a single executableprogram. Therefore, the following description of each of the modules isused for convenience to describe the functionality of the preferredsystem. Thus, the processes that are undergone by each of the modulesmay be arbitrarily redistributed to one of the other modules, combinedtogether in a single module, or made available in, for example, ashareable dynamic link library.

The system may include any type of electronically connected group ofcomputers including, for instance, the following networks: Internet,Intranet, Local Area Networks (LAN) or Wide Area Networks (WAN). Inaddition, the connectivity to the network may be, for example, remotemodem, Ethernet (IEEE 802.3), Token Ring (IEEE 802.5), Fiber DistributedDatalink Interface (FDDI) or Asynchronous Transfer Mode (ATM). Note thatcomputing devices may be desktop, server, portable, hand-held, set-top,or any other desired type of configuration. As used herein, an Internetincludes network variations such as public internet, a private internet,a secure internet, a private network, a public network, a value-addednetwork, an intranet, and the like.

The system may be used in connection with various operating systems suchas: UNIX, Disk Operating System (DOS), OS/2, Windows 3.X, Windows 95,Windows 98, Windows 2000 and Windows NT.

The various software aspects of the system may be written in anyprogramming language or combinations of languages such-as C, C++, BASIC,Pascal, Perl, Java, and FORTRAN and ran under the well-known operatingsystem. C, C++, BASIC, Pascal, Java, and FORTRAN are industry standardprogramming languages for which many commercial compilers can be used tocreate executable code.

A system is one or more computers and associated peripherals that carryout selected functions. For example, a Customer system includes thecomputer hardware, software and firmware for executing the specificsoftware instructions described below. A system should not beinterpreted as being limited to be a single computer or microprocessor,and may include a network of computers, or a computer having mutiplemicroprocessors.

Transmission Control Protocol (TCP) is a transport layer protocol usedto provide a reliable, connection-oriented, transport layer link amongcomputer systems. The network layer provides services to the transportlayer. Using a two-way handshaking scheme, TCP provides the mechanismfor establishing, maintaining, and terminating logical connections amongcomputer systems. TCP transport layer uses IP as its network layerprotocol. Additionally, TCP provides protocol ports to distinguishmultiple programs executing on a single device by including thedestination and source port number with each message. TCP performsfunctions such as transmission of byte streams, data flow definitions,data acknowledgments, lost or corrupt data re-transmissions andmultiplexing multiple connections through a single network connection.Finally, TCP is responsible for encapsulating information into adatagram structure.

A programmed storage device can be any of a series of well-knownprogrammable storage devices. Examples include, Electrically EraseableProgrammable Read Only Memories (EEPROMs) or other programmable memoriesknown in the art. Other examples of programmed storage devices includehard disks, floppy disks, conventional memory, and the like.

2. The Invention

A basic assumption is that genetic similarity should correlate withphenotypic similarity at relevant genetic sites; there should be arelationship between genotype and phenotype. The validity of thisassumption for a given population has important implications for theaccuracy and the understanding of the results of a given study, forinstance in studies of the genetic basis of ethnicity (Example 1) orstudies to determine a genetic difference between a patient populationand a control population (Example 2), among others.

A corollary to this assumption is that a given population ishomogeneous; the patient and control populations for instance should begenetically different from each other in the trait studied and the lociof relevance to the determination of this trait, but should otherwise bea homogenous population. What is needed are methods to a prioridetermine the genetic similarity of members of a population; the numberof potential clusters of genetically similar individuals orsubpopulations contributing to the putative heterozygosity of a sampleshould be determined. If no such clusters exist, the sample ishomogenous. If such clusters exist, they can and should be accounted forin an analysis of the sample.

However, up to now, the methods to determine the number of “clusters” ina population require that the number of clusters in a population isalready known. Basically, given a population, one could say there shouldbe three clusters (for example, one for Black Americans, one forCaucasians, and one for Chinese Americans), run the clustering algorithmand derive an answer. However, it was impossible to take a populationcontaining Black Americans, Caucasians, and Chinese Americans anddetermine how many genetically distinct groups really existed: one, two,three, or more.

Problem to be Solved:

Approaches for cluster analysis can be classified into two types:hierarchical and non-hierarchical. In the end, the clustering method ofthe invention will use both hierarchical and non-hierarchical analyses.But first, a hierarchical cluster analysis is performed and used as thebasis for the algorithm to determine the optimal number of clusters forthe data.

Hierarchical cluster analysis produces a series of overlapping groups.The end product of a wide variety of hierarchical classificationprocedures is a dendrogram; these are the classical <<trees>>. This is anested collection of clusters with two trivial extremes: the singletonand the whole set. The clusters at one level are constructed from theclusters at the previous level.

There are two basic approaches to hierarchical cluster analysis:agglomerative methods which build up clusters starting from individualsuntil there is only one cluster, or divisive methods which start with asingle cluster and split clusters until the individual level is reached.Hierarchical cluster analysis can be represented by a tree (dendrogram)that shows at which distance the clusters merge. To every dendrogramthere corresponds a unique ultrametric dissimilarity matrix.

The stages in a hierarchical analysis are usually as follows:

-   -   1. Form a distance matrix;    -   2. Use selected criterion to form hierarchy; and    -   3. Form a set of clusters and/or dendrogram.

A crucial problem in interpreting tree diagrams (hierarchicalclustering) is deciding how many clusters there are from the resultingtree. If there are 100 individuals in a group, the tree diagram has allthe individuals in one cluster at one end and each individual in aseparate cluster (i.e. 100 clusters) at the other end. Somewhere inbetween is the optimum numbers of clusters for the data. Typically, theresearcher determines the expected number of clusters based onphenotypic data, for example, and chooses the level of the dendrogramcontaining this many clusters. However, no matter how the decision ismade, it is relatively arbitrary, and not necessarily reflective of theunderlying data.

Solution to the Problem:

Thus, to assess the inherent groups in a population a new method wasnecessary. Although the method of the invention was developed bycomparing the pairwise similarity of alleles, it can be applied to anytrait in a population to which weighted values can be assigned todetermine the heterogeneity or homogeneity of the population for thetrait.

Weight Function:

The first step was to identify a weight function for the sharing ofalleles between members of a population.

A measure of the pairwise similarity between individuals in a sample canbe defined as:$S_{ij} = {\sum\limits_{k = 1}^{L}\frac{a_{k,{ij}} \cdot w_{r,k,{ij}} \cdot w_{h,k,{ij}}}{2L}}$

-   where i and j denote the particular individuals;-   where k denotes a particular locus;-   where L denotes the total number of loci;-   where    -   a_(k,ij) equals 0 for locus k when individuals i and j share no        common allele,    -   a_(k,ij) equals 1 for locus k when individuals i and j have one        common allele, and    -   a_(k,ij) equals 2 for locus k when individuals i and j have two        common alleles;-   where w_(r,k,ij) denotes a weight function for sharing rare alleles;    and-   where w_(h,k,ij) denotes a weight function for sharing a homozygous    genotype at the locus.

The weighting will vary depending on the comparison that you are makingand the desired outcome. It can be given as discrete numbers forinstance 30 and 30 as used in Examples 2 and 3 herein, or it can bewritten as a function. For instance if you are looking specifically forindividuals with a rare allele, the presence of this shared allele couldbe given a very high value—2,000 for example. The weighting can also bedescribed as a function, which in this case should be dependent on thenumber of loci among other things. For different studies, differentwieght functions would be appropriate and could be determined by thoseof ordinary skill in the art.

Distance Matrix:

A distance matrix can then be generated from the pairwise similaritymatrix for the individuals through a number of means, one of which isthe Euclidean squared distance:$d_{jk} = {\sum\limits_{i = 1}^{k}\left( {S_{ji} - S_{ki}} \right)^{2}}$

A suitable distrance matrix provides a measure of how similarindividuals are to each other, in this case genetically. Commondistances that are used and are appropriate for use in the methods ofthe invention include, but are not limited to, Euclidean distance,Euclidean squared distance, absolute distance (city block metric), and acorrelation coefficient.

Hierarchical Clustering:

Subsequently, agglomerative (hierarchical) clustering methods areapplied to produce a dendrogram (or tree). The tree starts with nclusters, each with a single individual, and then at each subsequentstep (n−1, n−2, n−3, etc) two clusters merge to form a larger clusteruntil all the individuals are present in a single cluster. To determinewhich two clusters merge at each subsequent step, the clusteringalgorithm calculates the minimum variance within each cluster based onthe distance between the pairwise similarity of individuals in thecluster based on the distance matrix (described above):$d_{i,{jk}} = \frac{{\left( {n_{i} + n_{j}} \right)d_{ij}} + {\left( {n_{i} + n_{k}} \right)d_{ik}} - {n_{i}d_{jk}}}{n_{i} + n_{j} + n_{k}}$

-   -   where n_(i), n_(j), n_(k) are the number of individuals in each        cluster, i, j, k. The two clusters that when combined produce        the minimum within-cluster variance, are then combined to form a        single cluster, and the clustering algorithm then re-calculates        the minimum variance and the process continues until there is        only one cluster. Alternatives to using a measurement of the        minimum variance include nearest neighbour, furthest neighbour,        group average, or median computations, for example, or any other        method typically used in the art.

Optimal Cluster Number Determination:

The results of the hierarchical clustering algorithm are then used todetermine the optimal number of clusters. The optimal number of clustersis found at the point in the tree where the average of the pairwisesimilarity for the individuals in the newly formed cluster (averageintracluster similarity) is smaller than the average of the pairwisesimilarity for the individuals in either of the two parent clusters.This can also be seen as the point where the average pairwiseintracluster similarity for either of two parent clusters is larger thanthe average pairwise intracluster similarity for a new cluster.

The following criterion is applied as an indicator for the number ofclusters present in the data:[{overscore (S)}_(k)]_(L)≦[{overscore (S)}_(m)]_(L−1) {for ∀k, ∀m}where the subscripts k and m indicate the number of clusters in step Land L−1 respectively, and {overscore (S)}_(k) is the averageintracluster similarity defined as:${\overset{\_}{S}}_{k} = {2{\sum\limits_{{i < j} = 2}^{n}\frac{S_{ij}}{n\left( {n - 1} \right)}}}$

The average intracluster similarity of clusters <<k>> is compared withthe average intracluster similarity of the two, parental, k−1 clusters.If either of the parental k−1 clusters have an average intraclustersimilarity smaller than the average intracluster similarity of the newlyformed k cluster, the number of optimal clusters for the data is foundat k−1.

Non-Hierarchical Clustering:

Once the optimal number of clusters is determined, a non-hierarchicalclustering algorithm, which requires that the number of clusters in apopulation be defined prior to running the algorithm, is run on thedata. The non-hierarchical cluster analysis partitions the set ofindividuals into the pre-determined number of clusters (as determinedabove) so as to optimize the intracluster sum of squares (e.g. by theK-means clustering method) of the previously generated pairwisesimilarities from the pairwise similarity matrix. The optimalintracluster sum of squares results when the global minimum is reached.

Other classifiers that may be used for K-means clustering include, butare not limited to, nearest neighbor method (K-NNM), weighted K-NNM, andpseudo nearest neighbor combined classifier. Combinations of differentclassfiers can be made in order to construct more effectiveclassification rules (Ed. S. K. Pal & P. P. Wang, Genetic Algorithms forPattern Recognition, CRC Press 1996). In another alternative, a modalclustering method could be used in which not distance, but gradient,becomes of importance.

The result from non-hierarchical clustering is that the individuals inthe popultion are optimally distributed into the optimal number ofclusters.

Paired-Pair Analysis:

Additional factors to be resolved are the potential presence of multiple((types)) of homozygotes, heterozygotes, or both within each cluster.The similarity measure, only weights homozygosity or heterozygositywithin a pair. As a result, in a comparison of two sets of pairedindividuals that have, for example, the same similarity score, one canhave two sets of paired individuals that are homozygous or heterozygousat the same locus and two sets of paired individuals that are homozygousor heterozygous at two different loci. Similarly, one can have two setsof paired individuals that share the same homozygous or heterozygousallele, and two sets of paired individuals where the homozygous orheterozygous allele is different.

To resolve the issue of homozygosity or heterozygous, a paired-pairanalysis has been developed, where the homozygous or heterozygous lociof two pairs are compared pairwise to determine whether the pairs sharethe same homozygous or heterozygous alleles on the same loci:$Z_{k,l} = {\sum\limits_{i = 1}^{L}a_{i}}$

-   where a₁=1 when two sets of pairs have the same homozygous or    heterozygous alleles on the same loci, otherwise a_(i)=0;-   where L denotes the total number of loci;-   where k and l each represent different pairs of individuals in a    particular cluster; and-   where k=1, . . . , N, and l=1, . . . , N, where N is the number of    individuals in particular cluster.

Subsequently, the average score (sum of Z_(k,l)) of said paired-paircomparison for all pairs in a cluster is computed:$W_{kl} = {\sum\limits_{{i > j} = i}^{N}Z_{ij}}$$\overset{\_}{Z} = \frac{\sum\limits_{{i > j} = {1{({{ij} \neq {kl}})}}}^{N}W_{ij}}{M}$where M is the number of permutations of paired-pairs, and W_(kl) is thesum of the paired-pair comparison of one pair versus all pairs in acluster.

Subsequently, the sum of the comparison of one pair versus all pairs ina cluster is compared with the average sum for all pairs in order toassign a value to the homozygous or heterozygous relatedness of eachmember of a pair to all homozygotes or heterozygous in a cluster: if,${{{\begin{matrix}{W_{kl} \geq {\overset{\_}{Z}\quad{then}\quad\left\{ \begin{matrix}{a_{i,{lo}} = 0} \\{a_{i,{ob}} = 1}\end{matrix} \right.}} \\{W_{kl} < {\overset{\_}{Z}\quad{then}\quad\left\{ \begin{matrix}{a_{i,{lo}} = 1} \\{a_{i,{ob}} = 0}\end{matrix} \right.}}\end{matrix}\quad{for}\quad{each}\quad k} > 1} = 1},\ldots\quad,{N;{{{and}\quad i} = k}},l$

-   where a_(lo) indicates that the individual's score for W is “below”    the average score for the cluster, and-   where a_(ob) indicates that the individual's score for W is “above”    the average score for the cluster.

Next, the total number of “a”s for each individual is computed:$b_{k,{lo}} = {\sum\limits_{{k > l} = 1}^{N}a_{k,{lo}}}$$b_{k,{ob}} = {\sum\limits_{{k > l} = 1}^{N}a_{k,{ob}}}$for each individual k in particular cluster.This allows a comparison of the number of times the sum for a pair isgreater than or equal to the average sum of all pairs with the number oftimes the sum for a pair is less than the average sum for all pairs, foreach individual in a given cluster.

The decision if and how to split a particular cluster is based on the“b” score for each individual. One cluster is constructed forindividuals for which the following is satisfied:b_(k,lo)≧b_(k,ob)

Individuals for which this is not satisfied are assigned to anothercluster. Thus, a cluster is divided into a first cluster and a secondcluster if there is:

-   -   a first group of members of a cluster where the number of times        the sum of one pair is less than the average sum of pairs is        greater than or equal to the number of times the sum for one        pair is greater than or equal to the average sum for all pairs,        and    -   a second group of members of a cluster wherein the number of        times the sum of one pair is less than the average sum of pairs        is less than the number of times the sum for one pair is greater        than or equal to the average sum for all pairs, and    -   where the first group of members are placed into the first        cluster and the second group of members are placed into the        second cluster.

Values can be assigned to the homozygous or heterozygous relatedness ofeach member of a pair to all homozygotes or heterozygous in said clusterbased on whether the sum for one pair is greater than or equal to theaverage sum of all pairs, or whether the sum for one pair is less thanthe average sum for all pairs. The number of times the sum for one pairis greater than or equal to the average sum of all pairs with the numberof times the sum for one pair is less than the average sum for all pairsfor each individual in said cluster. These methods are given anddescribed in more detail in the Detailed Description of the Invention.Equations that are superficially different but that achieve the sameresult are considered to be part of the claimed invention. In all caseswhere the paired-pair analysis is discussed, equations and analysis ofhomozygosity apply equally as equations and analysis of heterozygosity.

3. Implementation

Implementation of the methods of the invention in software and oncomputer systems is described in detail herein in the Implementation ofMethods of the Invention section.

PREFERRED EMBODIMENTS OF THE INVENTION

I. Methods of Clustering Members of a Sample

The invention features methods of clustering members of a sample.Although the examples and the preferred embodiments are focussed towardsgenetic analysis, these methods can be used on any data needingclustering. Therefore “members” is used herein to mean the objects orindividuals (or individual objects) that make up a sample. A “sample”encompass anything made up of discrete objects that can be grouped insome way. This could be as simple as a group of wooden blocks or ascomplex as a population of people.

Advantages of this preferred embodiment include weighting rare traits aspart of the similarity calculation. This allows a more accuratedetermination of whether the sample is homogenous or heterogenous forthe traits studied. It also permits members of great interest to beidentified as a result of the rareness of a trait. Other advantagesinclude the clustering of members of a sample without having to rely ona subjective analysis of the number of groups present by determining the“optimal number of clusters” resulting from a hierarchical clusteringalgorithm, and a paired-pair analysis that permits the determination ofthe similarity of pairs of homozygous pairs. “Homozygous” pairs can beused to refer to the sharing of the same allele genetically, or to thesharing of the same value for any trait under study.

Either as part of the method, or in preparation to perform the method,the traits to be studied must be identified. The identification of thetraits is dictated by the questions asked and will vary depending on thesample. For example, to determine the overall similarity of apopulation, dispersed traits might be assessed; to determine whether aparticular group is present in a population, traits relating to thatgroup in particular would be assessed. The genotypes of particular lociare an example of traits. The specific loci chosen to be used in thecluster analysis for any given population will depend on the populationand the question asked. For the Examples using the methods of theinvention, SNPs have been used. Use of these values for traits isadvantageous over RFLPs, etc, but these other forms of geneticinformation can also be used.

Values also have to be assigned to different aspects of a given trait.By “value” is meant a numerical representation of the desirability of acertain aspect of a trait being shared between two members of a sample.Again, this can either be done as part of the methods of the invention,or as a prior step, since the values given will relate primarily to thedesign of the study. However, in general, values are assigned based tosome extent on the frequency or desirability of a particular aspect of agiven trait. In the example provided, values are assigned based on thealleles at genetic loci. Therefore, when a pair of members share nocommon allele, this is valued as 0; when a pair of members share acommon allele this is valued as 1; and when a pair of members share twocommon alleles this is valued as 2. Part of the judgement is that thestudy is designed to look for similarity.

In a related aspect, weights are assigned for some values based on thefrequency of selected trait values in said sample. By “weights” is meanta value that can take the form of a number or a function. Weights aregenerally applied to trait values shared between two members of thepopulation that are rare or are desirable (in the sense of wanting to beknown/identified) or both. For example, weights can be assigned forsharing rare alleles between a pair of members, and for sharing ahomozygous genotype between a pair of members. Again, the shared traitvalues to be weighted will differ for each study. In the present case,since similarity between pairs is desired, sharing of a homozygousgenotype (sharing two of the same allele at a particular locus, e.g.“AA”) was given a weight of “30”. Rare alleles were also given a valueof “30”. Rare alleles would be those that are the less frequentlyobserved allele of the SNP; this is determined based on empirical data.Other trait values can be weighted in similar manner according to thestudy design. In addition, weight functions can be drafted, in this caseallowing variation of the weighting depending on the size of thepopulation, for example.

An ordered set of similarity data is formed by the methods of theinvention by determining the similarity between pairs of members of saidsample using said trait values and said weights. The ordered set ofsimilarity data can be present in a similarity matrix, but so long asthe data is ordered in such a way as to permit its use in clusteringalgorithms it need not be in a similarity matrix. One of ordinary skillin the art would be able to determine other acceptable forms to allowthe methods of the invention. The similarity data set or similaritymatrix can be formed based on the pairwise similarity measure:$S_{ij} = {\sum\limits_{k = 1}^{L}\frac{a_{k,{ij}} \cdot w_{r,k,{ij}} \cdot w_{h,{kij}}}{2L}}$

-   where i and j denote the particular members;-   where k denotes a particular locus;-   where L denotes the total number of loci;-   where    -   a_(k,ij) equals 0 for locus k when members i and j share no        common allele,    -   a_(k,ij) equals 1 for locus k when members i and j have one        common allele, and    -   a_(k,ij) equals 2 for locus k when members i and j have two        common alleles; where-   w_(r,k,ij) denotes a weight function for sharing rare alleles; and-   where w_(h,k,ij) denotes a weight function for sharing a homozygous    genotype. Other pairwise similarity measures can also be used so    long as they permit (or can be modified to permit) the incorporation    of the weight functions.

An ordered set of distance data can be calculated from the ordered setof similarity data. Alternatively, the ordered set of distance data canbe calculated from the similarity matrix. Alternatively, the distancedata is present in a distance matrix. The distance matrix can begenerated using distances selected from the group consisting ofEuclidean squared distance, Euclidean distance, absolute distance, andcorrelation coefficient. Most preferably, the distance matrix isgenerated using Euclidean squared distance. However, one with skill inthe art would be able to determine other distance measurements that canbe used for this comparison. Any distance measurement that can be usedin hierarchical clustering is specifically contempleted for use. Anexemplary distance algorithm is given in the Detailed Description of theInvention.

A hierarchical clustering algorithm is then applied to the ordereddistance data or distance matrix. The hierarchical clustering algorithmproduces a dendrogram based on calculations selected from the groupconsisting of minimum variance, nearest neighbor, furthest neighbor,group average, and median computations. Preferably, the hierarchicalclustering algorithm produces a dendrogram based on minimum variancecalculations. However, one with skill in the art would be able todetermine other bases for the hierarchical clustering algorithm. Anybase that permits hierarchical clustering is specifically contempletedfor use. An exemplary hierarchical clustering algorithm is given in theDetailed Description of the Invention.

The optimal number of clusters can be determined based on the results ofthe hierarchical clustering algorithm. By “optimal number of clusters”is meant the number of distinct groups in a sample where to add one morecluster would mean that the overall level of similarity within eachcluster would decrease rather than increase. The optimal number ofclusters is found where the average pairwise intracluster similarity fortwo parent clusters is larger than the average pairwise intraclustersimilarity for a new cluster. An algorithm representing this is providedin the Detailed Description of the Invention. Other algorithms thatachieve the same result by different means are considered to also bepart of the claimed invention.

A non-hierarchical clustering algorithm can be applied to the orderedset of similarity data (or the similarity matrix) using the optimalnumber of clusters. Any non-hierarchical clustering algorithm known inthe art can be used, although preferably the non-hierachical clusteringalgorithm uses K-means clustering. Preferably, the K-means clusteringuses classifiers selected from the group consisting of sum of squares,nearest neighbor method, weighted nearest neighbor method, and pseudonearest neighbor combined classifier, although others in the art arealso specifically envisioned to be used. Most preferably, saidclassifier is sum of squares.

The relatedness between pairs of homozygous pairs can be determined byperforming a paired-pair analysis on the clusters resulting from thenon-hierarchical clustering algorithm, where the homozygous loci of twopairs are compared pairwise to determine whether the pairs share thesame homozygous alleles on the same loci. Equations detailing this andthe following steps are provided in the Detailed Description of theInvention. Equations that are different but that achieve the same resultare considered to be part of the claimed invention. In addition,alternative computer embodiments are described under Implementation ofthe Methods of the Invention, herein.

A cluster is then divided into a first cluster and a second cluster ifthere is: a first group of members of said cluster wherein said numberof times said sum of one pair is less than said average sum of pairs isgreater than or equal to said number of times said sum for one pair isgreater than or equal to said average sum for all pairs, and a secondgroup of members of said cluster wherein said number of times said sumof one pair is less than said average sum of pairs is less than saidnumber of times said sum for one pair is greater than or equal to saidaverage sum for all pairs, and wherein said first group of members areplaced into said first cluster and said second group of members areplaced into said second cluster. This step can optionally be repeated ifthere is a question whether the resulting cluster(s) contains more thanone group of homozygous members.

II. Methods of Non-Hierarchical Clustering

The invention also features methods of non-hierarchical clustering ofmembers of a sample that includes using paired-pair analysis todistinguish among the various homozygous forms that may be present in asample, and determine whether in fact the clusters should be furtherseparated. This step can be repeated as necessary.

A non-hierarchical clustering algorithm is applied to a sample, eitheras part of the method, or as a prior step; alternatively a sample thatis thought to be a population, but that has homozygous members, can beassessed by this method, whether or not a non-hierarchical clusteringmethod is performed on it. Thus, this method can be used on anypopulation to assess the presence of more than one group of homozygousmembers.

The relatedness between pairs of homozygous pairs is determined byperforming a paired-pair analysis on the clusters resulting from saidnon-hierarchical clustering algorithm using the same methods that aredescribed in the first Preferred Embodiment of the Invention and theequations discussed in the Detailed Description of the Invention. Asstated previously, the equations set forth herein and used to achievethe desired results are exemplary and do not limit the invention toachieving the described results through the use of these equations.Other equations that achieve the same results are expressly intended toform part of the invention.

III. Methods of Detecting the Presence of Clusters in a Sample

The invention also features methods of determining the presence ofclusters in a sample. This method can be used to simply determinewhether for given traits a sample is homogenous. Similarly, it can beused to determine whether an individual, individuals, small group, orother population, is part of the population or whether it forms aseparate group.

As described in the first Preferred Embodiment of the invention, traitsare chosen, values are given, and weights are assigned based on factorsdetermined to be relevant to the population studied and to the questionasked. Similarly, an ordered set of similarity data or similarity matrixis obtained as described previously in the first Preferred Embodiment ofthe invention and in the Detailed Description of the Invention.

In some cases, this method further comprises calculating an ordered setof distance data or a distance matrix from said ordered set ofsimilarity data or similarity matrix, again as described previously inthe first Preferred Embodiment of the invention and in the DetailedDescription of the Invention. This is useful when a hierarchicalclustering algorithm is to be applied to the data.

A clustering algorithm, either hierarchical or non-hierarchical, is thenapplied to the ordered set of similarity data or the similarity matrix(or the ordered set of distance data or distance matrix). Hierarchicaland non-hierarchical clustering algorithms have been describedpreviously in the first Preferred Embodiment of the invention and in theDetailed Description of the Invention. If a hierarchical clusteringalgorithm is used, the method of determining the optimal number ofclusters in a sample is preferentially used as described previously inthe first Preferred Embodiment of the invention and in the DetailedDescription of the Invention. The method is also described further inthe fourth Preferred Embodiment of the Invention, below.

IV. Methods of Determining the Number of Clusters in a Sample

The invention also features methods of determining the number ofclusters in a sample based on the results of a hierarchical clusteringalgorithm. Since a hierarchical clustering algorithm results in a treewith one end being the entire sample and the other end being where eachmember of a sample is in its own cluster, another method is necessary todetermine at which point the optimal number of clusters is present. Themethod of the invention can be used on any set of data following itsanalysis by hierarchical clustering. The method of optimal clusteringhas been described previously as part of the first Preferred Embodimentof the invention and in the Detailed Description of the Invention.

V. Implementation of Methods of the Invention

The automated system for clustering can be implemented through a varietyof combinations of computer hardware and software. In oneimplementation, the computer hardware is a high-speed multiprocessorcomputer running a well-known operating system, such as UNIX. Thecomputer should preferably be able to calculated millions, tens ofmillions, billions or more floating-point operations by second. Theamount of speed is advantageous for determining the homogeneity and/orheterogeneity of a population or sub-population for even the largestpopulation sizes within a reasonable period of time. Such computers aremanufactured by companies such as Intel; International BusinessMachines; Silicon Graphics, Incorporated; Hitachi; and Cray,Incorporated.

While it is envisioned that currently available personal computers usingsingle or multiple microprocessors might also function within theparameters of the present invention, especially for small populations,such a computer system might be too slow to perform all the combinationsof the paired-pair analysis required for large populations. However, asthe efficiency and speed of micro-processor based computer systemsincreases, the likelihood that a conventional personal computer would beuseful for the present invention also increases.

Preferably, the software that runs the calculations for the presentinvention is written in a language that is designed to run within theUNIX operating system. The software language can be, for example,FORTRAN, C, C++, PERL, Python, Pascal, Cobol, Java, any other computerlanguage known, or combinations of languages. It should be noted thatthe nucleic acid sequence data will be stored in a database and accessedby the software of the present invention. These programming languagesare commercially available from a variety of companies such asMicrosoft, Compaq, and Borland International.

In addition, the software described herein can be stored on severaldifferent types of media. For example, the software can be stored onfloppy disks, hard disks, CD-ROMs, Electrically Erasable ProgrammableRead-Only Memory, Random Access Memory, or any other type of programmedstorage media.

Referring to FIG. 18, a system 10 that includes a data storage 20, suchas that described above, is linked to a memory 25. Associated with thememory 25 is an analysis module 28 that provides commands andinstructions for the data analysis functions described below.Communicating with the memory 25 is a processor 30 that is used toprocess the information being analyzed within the module 28.Conventional processors, such as those made by Intel, Compaq, andMotorola are anticipated to function within the scope of the presentinvention. As illustrated, an input 35 provides data to the system 10.The input 35 can be a keyboard, mouse, data link, or any other mechanismknown in the art for providing data to a computer system. In addition, adisplay 38 is provided to display the output of the analysis undertakenby the analysis module 28.

Referring now to FIG. 19, an alternative system 50 that includes a datastorage 55, such as that described above, but that is shared amongsub-systems 60 a, 60 b, and 60 c and is linked to memories 65 a, 65 b,and 65 c in each sub-system. Associated with each of the memories 65 a,65 b, and 65 c are analysis modules 68 a, 68 b, and 68 c that storecommands and instructions for performing the data analysis functionsdescribed below. Communicating with each of the memories 65 a, 65 b, and65 c are processors 70 a, 70 b, and 70 c that are used to process theinformation being analyzed within the analysis modules 68 a, 68 b, and68 c. Note that although three sub-systems are depicted in FIG. 19,their numbers can be increased or decreased as needed. Conventionalprocessors, such as those made by Intel, Compaq, and Motorola areanticipated to function within the scope of the present invention. Asillustrated, an input 75 provides data to the system 50. The input 75can be a keyboard, mouse, data link, or any other mechanism known in theart for providing data to a computer system. In addition, a display 78is provided to display the output of the analysis undertaken by theanalysis modules 68 a, 68 b, and 68 c for the sub-systems 60 a, 60 b,and 60 c.

The sub-systems 60 a, 60 b, and 60 c may be linked to the data storage55 and to each other in a variety of ways including, but not limited to,data links or network links. Various software products may be used tofacilitate the linking between sub-systems 60 a, 60 b, and 60 c such asthe publicly available Parallel Virtual Machine (PVM) and the MessagePassing Interface (MPI). PVM and MPI may be obtained fromhttp://www.epm.ornl.gov/pvm/ andhttp://www-unix.mcs.anl.gov/mpi/index.html, respectively.

Referring to FIG. 20, a process 100 of genetic cluster analysis of apopulation is illustrated. The process 100 begins at start state 102,and then moves to a process state 104 wherein a similarity matrix isconstructed. The process 100 then moves to a process state 106 wherein adistance matrix is constructed. The process 100 then moves to processstate 108 wherein hierarchal clustering is performed. The process 100then moves to process state 110 wherein the optimal number of clustersis determined based on the result of the process of 108. The process 110of determining the optimal number of clusters is described in moredetail in FIG. 21.

The process 100 then moves on to process state 112 wherein anon-heirarchal K-means clustering is used to distribute members of thepopulation into the number of clusters identified in state 110. Theprocess 100 then moves on to process state 114, wherein paired-pairanalysis is used to determine if any of the clusters need to be dividedagain. Process 114 is described in more detail in FIG. 22 and in analternative form in FIG. 23. The process 100 then moves to a state 116wherein the result is output to a display. The process 100 thenterminates at an end state 120.

Referring now to FIG. 21, the process 110 of determining the optimalnumber of clusters in a population is illustrated. The process 110begins at a start state 200 and then moves to a state 204 where theresults of a hierarchical clustering process are stored as a dendrogram.Process 110 then moves on to a process state 206 where the averagepairwise intracluster similarity of a new cluster is compared with theaverage pairwise intracluster similarity of its parental clusters. Adetermination is then made at a decision state 208 whether the averageintracluster pairwise similarity of the new cluster is greater than theaverage intracluster pairwise similarity of either parental cluster. Ifthe average intracluster pairwise similarity of the new cluster is notgreater than either parental cluster, the process 110 returns to the 206to compare the average intracluster pairwise similarity of the next newcluster with its parental clusters.

However, if the average intracluster pairwise similarity of the newcluster is greater than either parental cluster, the process 110 movesto a state 210 to identify the optimal number of clusters at theparental level. The process 110 then moves to a state 212 where theresult is output to a display. The process 110 then terminates at an endstate 214.

Referring now to FIG. 22, one embodiment of a process 114 of paired-pairanalysis is illustrated. The process 114 begins at start state 300 andthen moves to a process state 302 wherein the common homozygous locibetween all pairs in the similarity matrix generated at the state 104are identified. Process 114 then moves to process state 304 wherein apaired-pair comparison is performed, such that the number of commonhomozygous loci between each set of pairs is determined. Process 114then moves to process state 306 wherein the average paired-pair scorefor all pairs is calculated. A determination is then made at a decisionstate 308 for each set of pairs of whether the number of homozygous locibetween a set of pairs is greater than the average score for all pairs.If the Number is greater than the Average, process 114 moves to state310 wherein a counter X is incremented for each member of the pair. Ifthe Number is less than the Average, process 114 moves to state 312wherein a counter Y is incremented for each member of the pair. Fromstate 310 or state 312, process 114 then moves to decision state 314,wherein a determination is made for each member of the cluster as towhether counter X is greater than counter Y. If X is greater than Y,process 114 moves to state 316 wherein the member is assigned to onecluster. If X is less than Y, process 114 moves to state 318 wherein themember is assigned to a second cluster. From either state 316 or 318,process 114 moves to a state 320 wherein the result is output to adisplay. Process 114 then moves to a decision state 322 wherein adetermination is made whether all clusters have been analyzed. If allclusters have not been analyzed, the process 114 returns to state 302 toidentify common homozygous loci between all pairs in a new cluster. Ifall clusters have been analyzed, process 114 then terminates at an endstate 324.

Due to memory and clock-speed constraints imposed by the computationalhardware the paired-pair analysis process described above and diagrammedin FIG. 22 may be performed over multiple computers or processors.Paired-pair analysis may be cumbersome to perform depending on the sizeof the population, since the number of unique pairs is determined by(N²−N)/2. As a result, the number of paired-pair comparisons increasesdramatically with increasing numbers of individuals in the population.For illustration purposes, if the number of members N is equal to 100,the number of unique pairs would be 4950. Then for the paired-pairanalysis, where homozygous loci of pairs of individuals within a givencluster are to be compared pairwise, 12,248,775 ((4950²−49050)/2)comparisons would be required to be performed. To increase the speed ofthe calculations as well as to remove hardware constraints such memorylimitations imposed by a single computer or processor, a preferredembodiment would consist of, but not be limited to, performing thecomparisons over many computers or processors, optionally at least 5.

Throughout this application, various publications, patents, andpublished patent applications are cited. The disclosures of thepublications, patents, and published patent specifications referenced inthis application are hereby incorporated by reference into the presentdisclosure to more fully describe the state of the art to which thisinvention pertains.

EXAMPLES

Several of the methods of the present invention are described in thefollowing examples, which are offered by way of illustration and not byway of limitation. Many other modifications and variations of theinvention as herein set forth can be made without departing from thespirit and scope thereof.

Example 1 Use of the Genetic Clustering Algorithm to Assess the EthnicDiversity of a Population

Sample Population:

Individuals were recruited from the greater San Diego metropolitan area.Respondents were categorized into 5 major ethnic groups based uponself-reporting of the ethnic origin of both parents and allgrandparents. Five ethnic classifications were chosen: African American,US Caucasian, Chinese, Japanese and Hispanic/Latino. 95 DNA samples wereselected from each classification and included only samples withgrandparents reported from the same classification.

All participating individuals signed informed consent agreements in thisIRB approved study. Individuals were given coded identification andnames were not included in the database. DNA was isolated from wholeblood using standard techniques. Samples were stored frozen with barcoded labels.

Sixty-five SNPs were genotyped in each sample. Thirty-one SNPs were froma 450 kb region of chromosome 8, 19 were from an 80 kb region ofchromosome 1, and 15 were unlinked SNPs randomly distributed throughoutthe genome. These later SNPs were expected to be significantlypolymorphic (Heterozygosity>0.2) based upon a pilot SNP survey of FrenchCaucasians in which pooled DNA samples were sequenced. SNPs used in thisstudy, as well as the microsequencing primers, are detailed in theSequence Listing.

SNP Detection and Mapping:

BAC libraries were obtained as described by Woo et al (Nucleic AcidsRes. 22:4922-4931, 1994). Briefly, three different human genomiclibraries were produced by cloning partially digested DNA (BamHI,HindIII or NdeI partial digests) from a human lymphoblastoid cell line(derived from individual N°8445 from CEPH) into pBeloBAC11 (Kim et al.,Genomics 34:213-218, 1996). The combined BAC library contains cloneswith an average insert size of 150 kb corresponding to 12 human genomeequivalents. BAC clones covering the genomic region under study (aroundmarkers D8S277 and D1 S2785) were obtained by screening the BAClibraries with public STSs WI-14718, WI-3831, D8S1413E, WI-8327, WI-3823and ND4, using three-dimensional pooling (Chumakov, et al., Nature377:175-297, 1995). Subchromosomal localization of the BACs selected bySTS screening was verified by fluorescence in situ hybridization (FISH),performed on metaphasic chromosomes (Cherif et al., Proc Natl Acad SciUSA 87:6639-6643, 1990). BAC insert size was determined by Pulsed FieldGel Electrophoresis after digestion with NotI. The BACs selected by STSscreening and verified by FISH, were assembled into contigs and newmarkers were generated by partial sequencing of insert ends from some ofthem.

BACs were individually subcloned in pBluescript II Sk (+) vectorfollowing standard procedures and were obtained by sequencing of theseplasmid subclones. An average of 500 bases at each end of the subclonewas determined by fluorescent automated sequencing on ABI 377 sequencers(Perkin Elmer), using a dye-primer cycle sequencing protocol andThermoSequenase (Amersham Life Science).

The sequence fragments from the BAC subclones were assembled using Gap4software from R. Staden (Bonfield, et al., Nulceic Acids Res.23:4992-4999, 1995). Where required, directed sequencing techniques(primer walking) were used to complete sequences and link contigs.Unique contigs representing the consensus sequence were finallyreconstructed from the alignment of different fragments. The consensussequence was first analyzed using an automated software which eliminatesrepeat sequences.

To identify common SNPs, primer pairs were derived from consensussequence using the OSP software (Hillier and Green, PCR Methods Appl.1:124-8, 1991). Amplicons investigated covered both potential exons andrandom genomic regions. The PCR primers were then used to amplify thecorresponding genomic sequences in a pool of DNA from 100 unrelatedindividuals (blood donors of French origin). PCR reactions (25 mL totalvolume) contained 2 ng/μL pooled DNA, 2 mM MgCl₂, 200 μM each dNTP, 2.9ng/μL each primer, 0.05 unit/μL Ampli Taq Gold DNA polymerase (PerkinElmer) and 1×PCR buffer (10 mM Tris HCl pH 8.3, 50 mM KCl).Amplification reactions were performed in a PTC200 MJ Researchthermocycler, with initial denaturation at 95° C. for 10 min followed by40 cycles of denaturation at 95° C. for 30 sec, annealing at 54° C. for1 min, and extension at 72° C. for 30 sec. After cycling, a finalelongation step was performed at 72° C. for 10 min. Amplificationproducts from pooled DNA samples were sequenced on both strands byfluorescent automated sequencing on ABI 377 sequences (Perkin Elmer),using a dye-primer cycle sequencing protocol and ThermoSequenase(Amersham Life Science). Following gel image analysis and DNA sequenceextraction with ABI Prism DNA Sequencing Analysis software, sequencedata were automatically processed with AnaPolys (Genset), a softwaredesigned to detect the presence of SNPs among pooled amplifiedfragments. The polymorphism search is based on the presence ofsuperimposed peaks in the electrophoresis pattern from both strands,resulting from two bases occurring at the same position. The detectionlimit for the frequency of SNPs detected by sequencing pools of 100individuals is about 10% for the minor allele, as verified by sequencingpools of known allelic frequencies. However, more than 90% of the SNPsdetected by the pooling method have a frequency for the minor allelehigher than 20%.

Genotyping:

Heterozygote scoring for the six combinations of A, C, G, T werereported by our software as AC, AG, AT, CG, CT or GT. Therefore, wedesignated the first allele as allele1 and the second as allele2 for alldisequilibrium calculations. Genotyping of individual DNA samples wasperformed using a microsequencing procedure as follows. Amplificationproducts containing the SNPs were obtained by performing PCR reactionssimilar to those described for SNP identification. After purification ofthe amplification products, the microsequencing reaction mixture wasprepared by adding in a 20 μL final volume: 10 pmol microsequencingprimer (which hybridizes just upstream of the polymorphic base), 1 U ofThermosequenase (Amersham) or TaqFS (Perkin Elmer), 1.25 μLThermosequenase buffer (260 mM Tris HCl pH 9.5, 65 mM MgCl₂) or 2.5 μL5×CSA Sequencing Buffer (Perkin Elmer), and the two appropriatefluorescent ddNTPs (Perkin Elmer, Dye Terminator Set) complementary tothe nucleotides at the polymorphic site of each SNP tested. After 4minutes at 94° C., 20 microsequencing cycles of 15 sec at 55° C., 5 secat 72° C., and 10 sec at 94° C. were carried out in a GeneAmp PCR System9700 (PE Applied Biosytems). After reaction, the 3′-extended primerswere precipitated to remove the unincorporated fluorescent ddNTPs andanalyzed by electrophoresis on ABI 377 sequencers. Following gelanalysis with GENESCAN software (Perkin Elmer), data were automaticallyprocessed with AnaMIS (Genset), a software that allows the determinationof the alleles of SNPs present in each amplified fragment based onfluorescent intensity ratios.

SNP Data Analysis:

Genotype data was compiled and checked for scoring accuracy with 32duplicated samples. Sporatic PCR failures were tolerated as long as thenumber of genotypes for any one SNP in any one ethnic group did not fallbelow 88. The SNPs used for this study are provided in the SequenceListing.

Genetic Cluster Analysis of the Total Population:

Cluster analysis, using the genetic clustering algorithm of theinvention (Detailed Description of the Invention), was performed on thetotal population of 475 individuals (all ethnic groups combined).Cluster analysis was performed using three different sets of SNPgenotype results: 1) using 15 loci randomly distributed throughout thegenome, 2) using 50 loci distributed over chromosome 1 and 8, and 3)using all 65 loci together (FIG. 1). If the genetic diversity of theindividuals matched their claimed ethnic diversity, we should have 5clusters for each set of loci. In contrast, the results show that when65 loci are used, 9 clusters are found, and when 15 or 50 loci are used,six clusters are formed. In fact, the number of groups found isrelatively SNP-dependent, since different numbers of clusters are foundwith different sets of SNPs and the composition of the clusters isdifferent for the different sets of SNPs. This indicates the importanceof the choice of loci to use for genetic clustering analysis.

Subsequently, the individuals in each cluster for each analysis (with15, 50 or 65 loci) were identified based on their reported ethnicidentity (FIGS. 2-8). Again, if the genetic diversity of the individualsfor these loci is related to their ethnic diversity, we should seepredominantly one ethnic group in each cluster. However, members of thedifferent ethnic groups are found across all clusters (although indiffering ratios) based on results from any of the 3 groups of loci withfour exceptions: 1) no Chinese individuals were found in Cluster 2 when65 loci were used for analysis (FIG. 3), 2) no African Americans werefound in Cluster 4 when 15 loci were used for analysis (FIG. 5), 3) noUS Caucasians were found in Cluster 5 when 15 loci were used foranalysis (FIG. 6), and 4) no African Americans were found in Cluster 7when 65 loci were used for analysis (FIG. 8).

An examination of the clusters resulting from the analyses shows that tosome extent the African American group can be separated from the otherethnic groups. The analysis using 15 loci shows that there are: twoclusters, Cluster 5 and Cluster 6 (FIGS. 6 and 7, respectively), thatcontain predominantly African American individuals; three clusters,Cluster 1, Cluster 2, and Cluster 3 (FIGS. 2, 3, and 4, respectively),that contain predominantly non-African American individuals; and onecluster, Cluster 4 (FIG. 5), that contains about equal numbers ofindividuals from all the ethnic groups. The analysis using 50 loci showsthat there are: one cluster, Cluster 1 (FIG. 2), that containspredominantly African American individuals; and three clusters, Cluster3, Cluster 5, and Cluster 6 (FIGS. 4, 6, and 7, respectively), thatcontain predominantly non-African American individuals. When all 65 lociare used, the separation among the clusters is less clear, although twoclusters, Cluster 8 and Cluster 9 (FIG. 8), and to some extent Cluster 1(FIG. 2), show predominantly African American individuals, while twoclusters, Cluster 3 and Cluster 4 (FIGS. 4 and 5, respectively), showpredominantly non-African Americans.

Thus, this analysis indicates that it is likely that loci can beidentified that would be able to differentiate between at least AfricanAmerican and other ethnic groups. Potentially, using different loci orsubgroups of loci, it may be possible to also differentiate among theother ethnic groups.

Genetic Cluster Analysis of each Ethnic Population:

Cluster analysis, using the genetic clustering algorithm of theinvention (Detailed Description of the Invention), was performed on eachethnic group separately, using the SNP genotype results from the 15, 50,and 65 loci (FIG. 9). If each ethnic group was genetically similar forthe loci studied, then only one cluster should be present. However, thedata show that within each ethnic group there are many genetic clusters.This is especially striking for the US Caucasian group.

In addition, the ethnic groups identified in each cluster, when theanalysis was done on the total population with 15 and 50 loci, do notfall into discrete clusters that result from the analysis with 15 or 50loci for each ethnic group (FIGS. 10-12). The question addressed by thisanalysis, is whether the individuals of the ethnic groups that comprisea part of each cluster identified when the total population wasanalyzed, identify single clusters from the clusters that resulted whenthe ethnic group was individually analyzed. In fact, the individualsfrom the different clusters resulting from the total population analysisfall in more than one cluster resulting from the ethnic group analysis.

Thus, the ethnic groups themselves are genetically diverse and thesubgroups identified by the total population analysis do not fall intothe same genetic subgroups. This is true even for the African Americansubgroup using 15 loci from Cluster 5 and Cluster 6 (FIG. 12). TheseAfrican American individuals that are the predominant group in these twoclusters from the total population analysis, are found in almost all theclusters resulting from the clustering analysis of African Americansseparately.

Example 2 Use of the Genetic Clustering Algorithm to Assess theHeterogeneity of the Populations in a CNS Case Study

A study was designed to identify genes that predisposed to thedevelopment of a CNS disease. Candidate genes were selected because oftheir putative roles in the biology of the CNS disease. Five singlenucleotide polymorphic (SNP) markers were developed from genomic regionscontaining these candidate genes and genotyped in both patient andcontrol populations using the methods described in Example 1.

The patient population consisted of 140 predominantly Caucasian patientsdiagnosed with a CNS disease. Two populations were used as controlgroups. The first were 94 ethnically matched individuals who wereascertained to be free of psychiatric disease by history and psychiatricexamination. The second group consisted of 95 ethnically matchedindividuals from the San Diego area who were reportedly healthy on thebasis of history alone.

The genotypes were analyzed for differences in frequency between casesand controls as well as for allele frequency differences. Haplotypefrequencies, based on combinations of SNP markers, were estimated in thecase and control groups using expectation maximization algorithms andthe significance of frequency differences was assessed by permutationanalysis. This analysis indicated a significant frequency difference forone haplotype between the case population and the psychiatricdisease-free control population. However, a similar analysis between thecase population and the San Diego control revealed no significantdifference. Therefore, the biological significance of this haplotypefrequency difference was unclear.

Because this method of haplotype frequency estimation assumes populationhomogeneity, the cluster analysis tool of the invention was used toassess genetic diversity both in the combined population of case andcontrol groups, as well as within the case and control groupsseparately.

When the combined population was analyzed for heterogeneity using thegenetic clustering analysis of the invention (Detailed Description ofthe Invention) on the genotyping results from 5 loci, eight clusterswere identified, not the two that would correspond to the case andcontrol groups (FIG. 13). For each cluster, the proportion of case andcontrol individuals was then determined (FIG. 14). Basically, all theclusters have approximately the same proportion of individuals from caseand control except Cluster 1 and Cluster 8 (FIG. 14). Cluster 1 has ahigher proportion of individuals from the case population, and a lowerproportion of individuals from the matched control population, with theunmatched control population falling in the middle. In contrast, Cluster8 has a higher proportion of matched control individuals, compared tocase individuals, with the unmatched control group falling in themiddle.

These data suggest that individuals in Clusters 2 through 7 are notgenetically different for the loci studied, and that at least these lociare not linked to a potential genetic basis for the CNS disease. Incontrast, the differences seen in Clusters 1 and 8 indicate that asubpopulation of individuals may be genetically distinct at these loci,and that these loci may be linked to a potential genetic basis for theCNS disease.

Subsequently, the homogeneity of the case and control groupsindividually was assessed using the genetic clustering analysis of theinvention, and were found to be heterogenous (FIGS. 15-17). Then, thelocation of the case and control individuals separated into clusters inthe total population was determined in the clusters from the analysis ofthe case and control populations separately (FIGS. 15-17).Interestingly, and in contrast to the ethnic diversity study (Example2), the individual case and control clusters from the combinedpopulation generally identified single clusters generated from analyzingthe case and control data separately.

1. A method of determining the optimal number of clusters of members ina sample, comprising: a) applying a hierarchical clustering algorithm tosaid members of said sample; and b) determining the optimal number ofclusters based on the results of said hierarchical clustering algorithm.2. The method of claim 1, wherein said optimal number of clusters isfound where the average pairwise intracluster similarity for two parentclusters is larger than the average pairwise intracluster similarity fora new cluster.
 3. A computer system for determining the optimal numberof clusters in a sample, comprising: a) a first module configured toapply a hierarchical clustering algorithm to the members of said sample;and b) a second module configured to determine the optimal number ofclusters based on the results of said hierarchical clustering algorithm.4. The computer system of claim 3, wherein said optimal number ofclusters is found where the average pairwise intracluster similarity fortwo parent clusters is larger than the average pairwise intraclustersimilarity for a new cluster.
 5. A programmed storage device comprisinginstructions that when executed perform a method for determining thenumber of clusters in a sample, comprising: a) applying a hierarchicalclustering algorithm to the members of said sample; and b) determiningthe optimal number of clusters based on the results of said hierarchicalclustering algorithm.
 6. The programmed storage device of claim 5,wherein said optimal number of clusters is found where the averagepairwise intracluster similarity for two parent clusters is larger thanthe average pairwise intracluster similarity for a new cluster.